首页> 外文OA文献 >Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy- based heuristics
【2h】

Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy- based heuristics

机译:收获可比较的语料库并挖掘它们以获得相同的双语   使用统计分类和基于类比的启发式的句子

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Parallel sentences are a relatively scarce but extremely useful resource formany applications including cross-lingual retrieval and statistical machinetranslation. This research explores our new methodologies for mining such datafrom previously obtained comparable corpora. The task is highly practical sincenon-parallel multilingual data exist in far greater quantities than parallelcorpora, but parallel sentences are a much more useful resource. Here wepropose a web crawling method for building subject-aligned comparable corporafrom e.g. Wikipedia dumps and Euronews web page. The improvements in machinetranslation are shown on Polish-English language pair for various text domains.We also tested another method of building parallel corpora based on comparablecorpora data. It lets automatically broad existing corpus of sentences fromsubject of corpora based on analogies between them.
机译:平行句子是包括跨语言检索和统计机器翻译在内的任何应用程序中相对稀缺但极为有用的资源。这项研究探索了我们从先前获得的可比语料库中挖掘此类数据的新方法。该任务是高度实用的,因为与并行语料库相比,非并行多语言数据的存在量要大得多,但是并行语句是一种更为有用的资源。在这里,我们提出了一种Web爬网方法,用于从例如维基百科转储和Euronews网页。机器翻译的改进显示在针对各种文本域的波兰语-英语对上。我们还测试了基于可比语料库数据构建并行语料库的另一种方法。它可以基于类之间的类比自动扩展来自语料主体的现有句子语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号